GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis
ثبت نشده
چکیده
Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation problems. However, parallelizing LU factorization on the graphic processing units (GPUs) turns out to be a difficult problem due to intrinsic data dependence and irregular memory access, which diminish GPU computing power. In this paper, we propose a new sparse LU solver on GPUs for circuit simulation and more general scientific computing. The new method, which is called GPU accelerated LU factorization (GLU) solver (for GPU LU), is based on a hybrid right-looking LU factorization algorithm for sparse matrices. We show that more concurrency can be exploited in the right-looking method than the left-looking method, which is more popular for circuit analysis, on GPU platforms. At the same time, the GLU also preserves the benefit of column-based left-looking LU method, such as symbolic analysis and column level concurrency. We show that the resulting new parallel GPU LU solver allows the parallelization of all three loops in the LU factorization on GPUs. While in contrast, the existing GPU-based left-looking LU factorization approach can only allow parallelization of two loops. Experimental results show that the proposed GLU solver can deliver 5.71× and 1.46× speedup over the singlethreaded and the 16-threaded PARDISO solvers, respectively, 19.56× speedup over the KLU solver, 47.13× over the UMFPACK solver, and 1.47× speedup over a recently proposed GPUbased left-looking LU solver on the set of typical circuit matrices from the University of Florida (UFL) sparse matrix collection. Furthermore, we also compare the proposed GLU solver on a set of general matrices from the UFL, GLU achieves 6.38× and 1.12× speedup over the single threaded and the 16-threaded PARDISO solvers, respectively, 39.39× speedup over the KLU solver, 24.04× over the UMFPACK solver, and 2.35× speedup over the same GPU-based leftlooking LU solver. In addition, comparison on self-generated RLC mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing The Master of IEEE Projects Copyright © 2016LeMenizInfotech. All rights reserved LeMenizInfotech 36, 100 Feet Road, Natesan Nagar, Near Indira Gandhi Statue, Pondicherry-605 005. Call: 0413-4205444, +91 9566355386, 99625 88976. Web :www.lemenizinfotech.com/ www.ieeemaster.com Mail : [email protected] sparse LU solvers. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2. Enhancement of the project: Existing System: Before we present our new approach, we first review the two main stream LU factorization methods: 1) the left-looking G/P factorization algorithm [13] and 2) a variant of the rightlooking algorithms, such as the Gaussian elimination method. We then review some recent works on LU factorizations on GPU and the NVIDIA CUDA programming system. The LU factorization of an n × n matrix, A, has the form A = LU, where L is a lower triangular matrix and U is an upper triangular matrix. For a full matrix, LU factorization has O(n3) complexity as it has three embedded loops. GPU Architecture and CUDA Programming In this section, we review the GPU architecture and CUDA programming. CUDA is the parallel programming model for NVIDIA’s general-purpose GPUs. The architecture of a typical CUDAcapable GPU consists of an array of highly threaded streaming multiprocessors (SMs) and comes with up to a huge amount of DRAM, referred to as global memory. Take the Tesla C2070 GPU, for example. It contains 14 SMs, each of which has 32 streaming processors (SPs, or CUDA cores called by NVIDIA), four special function units (SFUs), and its own shared memory/L1 cache. The structure of an SM is shown in Fig. 1. The Master of IEEE Projects Copyright © 2016LeMenizInfotech. All rights reserved LeMenizInfotech 36, 100 Feet Road, Natesan Nagar, Near Indira Gandhi Statue, Pondicherry-605 005. Call: 0413-4205444, +91 9566355386, 99625 88976. Web :www.lemenizinfotech.com/ www.ieeemaster.com Mail : [email protected] Fig. 1. Diagram of an SM in NVIDIA Tesla C2070 [SP, L/S for load/store unit, and SFU]. As the programming model of GPU, CUDA extends C into CUDA C and supports such tasks as threads calling and memory allocation, which makes programmers able to explore most of the capabilities of GPU parallelism. In CUDA programming model, shown in Fig. 2, threads are The Master of IEEE Projects Copyright © 2016LeMenizInfotech. All rights reserved LeMenizInfotech 36, 100 Feet Road, Natesan Nagar, Near Indira Gandhi Statue, Pondicherry-605 005. Call: 0413-4205444, +91 9566355386, 99625 88976. Web :www.lemenizinfotech.com/ www.ieeemaster.com Mail : [email protected] organized into blocks; blocks of threads are organized as grids. CUDA also assumes that both the host (CPU) and the device (GPU) maintain their own separate memory spaces, which are referred to as host memory and device memory, respectively. The Master of IEEE Projects Copyright © 2016LeMenizInfotech. All rights reserved LeMenizInfotech 36, 100 Feet Road, Natesan Nagar, Near Indira Gandhi Statue, Pondicherry-605 005. Call: 0413-4205444, +91 9566355386, 99625 88976. Web :www.lemenizinfotech.com/ www.ieeemaster.com Mail : [email protected] Fig. 2. Programming model of CUDA. Disadvantages:
منابع مشابه
Parallel Incomplete-LU and Cholesky Factorization in the Preconditioned Iterative Methods on the GPU
A novel algorithm for computing the incomplete-LU and Cholesky factorization with 0 fill-in on a graphics processing unit (GPU) is proposed. It implements the incomplete factorization of the given matrix in two phases. First, the symbolic analysis phase builds a dependency graph based on the matrix sparsity pattern and groups the independent rows into levels. Second, the numerical factorization...
متن کاملParallel LU Factorization on GPU Cluster
This paper describes our progress in developing software for performing parallel LU factorization of a large dense matrix on a GPU cluster. Three approaches, with increasing software complexity, are considered: (i) a naive “thunking” approach that links the existing parallel ScaLAPACK software library with cuBLAS through a software emulation layer; (ii) a more intrusive magmaBLAS implementation...
متن کاملA CPU-GPU hybrid approach for the unsymmetric multifrontal method
Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetricmultifrontalmethod fromboth an ...
متن کاملFast radix sort for sparse linear algebra on GPU
Fast sorting is an important step in many parallel algorithms, which require data ranking, ordering or partitioning. Parallel sorting is a widely researched subject, and many algorithms were developed in the past. In this paper, the focus is on implementing highly efficient sorting routines for the sparse linear algebra operations, such as parallel sparse matrix matrix multiplication, or factor...
متن کاملParallel Triangular Solvers on GPU
In this paper, we investigate GPU based parallel triangular solvers systematically. The parallel triangular solvers are fundamental to incomplete LU factorization family preconditioners and algebraic multigrid solvers. We develop a new matrix format suitable for GPU devices. Parallel lower triangular solvers and upper triangular solvers are developed for this new data structure. With these solv...
متن کامل